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Abstract 

Item parameter estimates vary for a variety of reasons, ineluding estimation error, eharaeteristies 
of the examinee samples, and eontext effects (e.g., item location effects, section location effects, 
etc.). Although we expect variation based on theory, there is reason to believe that observed 
variation in item parameter estimates exceeds what theory would predict. This study examined 
both items that were administered linearly in a fixed order each time that they were used and 
items that had appeared in different adaptive testing item pools. The study looked at both the 
magnitude of variation in the item parameter estimates and the impact of this variation in the 
estimation of test- taker scores. The results showed that the linearly administered items exhibited 
remarkably small variation in parameter estimates over repeated calibrations. Similar findings 
with adaptively administered items in another high stakes testing program were also found when 
initial adaptively based item parameter estimates were compared with estimates from repeated 
use. The results of this study also indicated that context effects played a more significant role in 
adaptive item parameters when the comparisons were made to the parameters that were initially 
obtained from linear paper-and-pencil testing. 
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Computer-based testing, and adaptive testing in partieular, typieally depends upon item 
response theory (IRT). The advantages of IRT are well-known through the testing literature and 
have fueled the transition of computerized adaptive testing (CAT) from a research interest to a 
widely used practical application. However, the introduction of computer-based testing in high 
volume, high stakes settings has presented new challenges to testing practitioners. In most 
computer-based testing programs, it is necessary to administer items repeatedly over time. This 
continuous item exposure raises security concerns that were not fully appreciated by researchers 
when the theory and practice of CAT were first developed. 

In most CAT programs, steps are taken to protect the integrity of item pools through 
strategies such as item exposure control, pool rotation, and accelerated item development (Way, 
1998). Despite such efforts, maintaining CAT programs remains difficult because adaptive 
algorithms tend to select the most highly discriminating items. Efforts to increase item 
development bring increased costs and diminishing returns. As these items become exposed and 
are retired from use, developing sufficient replacement items of the same quality is very difficult: 
Three or four items may need to be written to find a suitable replacement. Furthermore, the lag 
time between the initial writing of the items and use of the items in an operational CAT pool is 
usually significant, as items must be pretested, calibrated, and evaluated before they may be used 
operationally. 

Recently, researchers at ETS have begun exploring an approach to adaptive testing that 
could address some of the challenges of item exposure and pool maintenance (Bejar et ah, 2002). 
Bejar (1991) referred to this approach as generative testing. More recently it has been called item 
modeling. The essence of item modeling is to create items from explicit and principled rules. The 
approach has roots in computer-assisted instruction and domain-referenced testing (Hively, 

1974). The obvious vehicle for item modeling is the computer, and successful applications of 
automated item generation have been reported by a number of researchers (Embretson, 1999; 
Irvine, Dunn, & Anderson, 1990; Irvine & Kyllonen, 2001). 

Although the capability to develop item models and generate items automatically is more 
easily established for some item types than for others, the potential utility of automated item 
generation for supporting computer-based testing is obvious. An effective item model provides 
the basis for a limitless number of items, each of which is assumed to share the same content and 
statistical characteristics. In CAT, the adaptive algorithm could choose an item model based on 
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the common psychometric characteristics, and the actual instance of the item would be generated 
at the time of delivery. Such an approach was referred to as on-the-fly adaptive testing by Bejar 
et al. (2002). They carried out a feasibility study of a CAT application where item models were 
utilized and concluded that the adaptive generative model they employed was both technically 
feasible and cost effective. 

From a traditional IRT perspective, the use of item models with adaptive testing seems 
far-fetched. In fact, much of the IRT literature in recent years has centered on item parameter 
estimation and parameter recovery, the idea being that successful applications of IRT depend 
upon well-estimated parameters. The notion that one could use a single set of IRT estimates to 
characterize all of the items generated from a particular model directly contradicts the goal of 
obtaining accurate item parameter estimates. However, such a perspective does not account for 
the variation that may occur in student scores due to a variety of effects that influence how test 
items are responded to in the real world. These include context effects, item position effects, 
instructional effects, variable sample sizes, and other sources of item parameter drift that are 
typically not formally recognized or controlled for in the context of CAT. 

Several researchers have documented the existence and influence of such item level 
effects. Sireci (1991) looked at the effect of sample sizes on the stability of IRT item parameter 
estimates. Kingston and Dorans (1984) described such effects in equating the paper-and-pencil 
GRE. Leary and Dorans (1985) and Brennan (1992) reviewed literature related to context effects 
and provided guidelines on how such effects might be minimized. Zwick (1991) described a case 
study of how context effects created an anomaly in the Reading test scores on the National 
Assessment of Educational Progress (NAEP). Divgi (1986) documented changes in item 
parameter estimates in an early application of the Armed Services Vocational Aptitude Battery 
(CAT-ASVAB). Several researchers have investigated causes of item parameter drift in testing 
programs that utilized IRT in test construction and equating over time (Eignor & Stocking, 1986; 
Kolen & Harris, 1990; Sykes & Eitzpatrick, 1992; Way, Carey, & Golub-Smith, 1992). 

In considering the viability of item models for CAT, we recognize that variation within 
models introduces a source of errors that is not present in traditional CAT. However, the 
repeated use of the same items across different CAT pools also introduces a source of errors that 
is tolerated but not accounted for. The purpose of this study was to investigate and to quantify 
the error that is currently tolerated in item parameter estimates for different sets of items used in 
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computer-based testing. The study examined items that were administered repeatedly to different 
examinee samples over time. We examined items that, eaeh time they were used, were 
administered linearly in a fixed order and also items that had appeared in different adaptive 
testing item pools. We examined both the magnitude of variation in the item parameter estimates 
and the impaet of this variation on the test takers’ sealed or reported seores. 

Case Study 1: Linear Administration of Items 



Data 

In order to earry out the investigation of the stability of parameter estimates in linearly 
administered tests, two sets of items from a high stakes admissions test were ehosen. The first set 
was eomposed of 28 items from the Quantitative (QNT) measure, while the seeond set eonsisted 
of 30 items from the Verbal (VBL) measure of the same test. Sinee ability distributions for the 
Quantitative measure are known to change more rapidly than the other measures, a greater 
variation in the parameter estimates was expeeted for that measure. 

The items eontained in the two sets eome from aetual test administrations in whieh these 
items were used as anehors to place parameter estimates on the base seale for other items in the 
linearly administered pretest seetions. In every online pretest ealibration for these CAT 
programs, anehor items are administered as similarly as possible to pretest items. The 
eomposition of the anchor set mirrors the pretests in terms of psyehometrie and content 
eharacteristics; the number of items in the pretest and anchor set is the same. The Verbal and 
Quantitative anehor items evaluated in the study were used over a 2-year period and were 
ealibrated for each administration of the corresponding pretest measure. Thus nine repeated 
ealibrations were available for eaeh anehor item. The average item parameter estimates for both 
Quantitative and Verbal measures are presented in Table 1. 

The ealibration samples were randomly obtained by spiraling the pretest forms aeross 
examinees. The sample sizes used to ealibrate eaeh item varied from 627 to 2,305 for the 
Quantitative measure and 830 to 2,284 for the Verbal measure. The details of sample sizes used 
for eaeh ealibration are presented in Table Al. The perfeet response patterns were exeluded from 
eaeh of the response sets; the resulting sample sizes are presented in the last eolumn of that table. 
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Table 1 

Average Item Parameter Estimates (2 l, h) 



Calibration 


a-parameter 


6-parameter 




Mean 


St. dev 


Mean 


St. dev 


QNT 


1 


0.842 


0.342 


-0.020 


1.163 


2 


0.852 


0.348 


-0.008 


1.145 


3 


0.826 


0.340 


-0.028 


1.166 


4 


0.764 


0.365 


-0.042 


1.341 


5 


0.764 


0.297 


-0.030 


1.256 


6 


0.779 


0.272 


0.016 


1.208 


7 


0.818 


0.339 


-0.047 


1.235 


8 


0.755 


0.335 


-0.182 


1.305 


9 


0.775 


0.299 


-0.036 


1.175 


VBL 


1 


1.003 


0.292 


-0.090 


1.150 


2 


0.983 


0.290 


-0.034 


1.205 


3 


0.954 


0.279 


-0.065 


1.222 


4 


0.967 


0.270 


0.035 


1.229 


5 


1.024 


0.291 


0.045 


1.124 


6 


1.129 


0.313 


0.046 


1.059 


7 


0.970 


0.278 


-0.010 


1.114 


8 


1.020 


0.301 


0.010 


1.107 


9 


0.954 


0.275 


-0.061 


1.186 



Parameter Estimation Methodology 

The item parameter estimates were obtained using the software LOGIST (Wingersky, 
Patrick, & Lord, 1988). LOGIST uses the joint maximum likelihood estimation methodology to 
estimate item parameters, keeping the ability parameters fixed, while formulating item parameter 
estimates. The ability parameters in this case were the actual ability estimates obtained on the 
operational section of the test. The estimates on the linearly administered items were then 
subjected to scaling using the test characteristic curve methodology proposed by Stocking and 
Lord (1983). In this study, the stability of estimates on both sets of anchor items was investigated 
after the scaling was carried out. 
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Analyses 

In order to look at the general trends in the variation of individual parameter estimates the 
a, b, and c parameters were plotted for eaeh item across calibrations. The purpose of this analysis 
was simply to get an idea of any directional change that could occur in some items over time. In 
order to look at the effect of parameter estimate variation on the probability of getting an item 
correct, the item characteristic curves were examined for each item across nine calibrations for 
both measures. The weighted Root Mean Squared Errors (RMSE) were then computed between 
the item characteristic curves for the various calibrations in relation to the first calibration. In 
other words the first calibration was chosen as a point of reference for all comparisons in this 
case. The RMSE in this case is defined as 



( 1 ) 

where P.^ {6 . ) is the probability of getting an item (/) correct in a calibration (c) at an ability level 

0j . The weight wj is the proportion of examinees out of the total number of examinees and n is 

the number of ability levels. The ability levels (and the corresponding weights) were derived 
from the reference paper-and-pencil (P & P) base form ability distribution for this particular 
program on the number-right scale. The number-right score levels ranged from 10 to 59 for 
Quantitative and 15 to 75 for Verbal resulting in 1 1 and 13 ability levels for the two measures 
respectively. The levels were then converted on to the theta metric as listed in Table 2. These 
ability levels ranged from -3.839 to 3.546 for Quantitative and -5.855 to 4.881 for Verbal. 

This index was used in similar research performed at ETS where the item characteristic 
curves (ICCs) obtained on different calibrations were compared (Guo, Stone, & Cruz, 2001; 
Rizavi & Guo, 2002). The RMSEs were then plotted for each item across calibrations to capture 
variation for items. 
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Table 2 

Ability Levels and Corresponding Weights for Quantitative and Verbal Measures 



Level 


Quantitative 


Verbal 


Ability 


Weight 


Ability 


Weight 


1 


-3.839 


0.001 


-5.855 


0.000 


2 


-2.184 


0.029 


-3.355 


0.006 


3 


-1.381 


0.100 


-2.337 


0.019 


4 


-0.812 


0.158 


-1.635 


0.049 


5 


-0.348 


0.172 


-1.074 


0.111 


6 


0.053 


0.155 


-0.585 


0.175 


7 


0.427 


0.125 


-0.127 


0.195 


8 


0.807 


0.106 


0.329 


0.163 


9 


1.242 


0.094 


0.800 


0.130 


10 


1.882 


0.055 


1.298 


0.084 


11 


3.546 


0.003 


1.856 


0.051 


12 






2.608 


0.017 


13 






4.881 


0.000 



Another interesting way to look at the variation is to estimate the varianee-eovarianee 
matrix of item parameter estimates. Several alternatives are available for eomputing the sampling 
varianees of item parameter estimates. The first is to use standard large-sample theory, whieh 
holds that the asymptotie varianees of < a, h , c > are given by the inverse of the 3x3 Fisher 
information (I) matrix evaluated at the true parameter values <a, b, c> (Lord, 1980; Hambleton, 
Swaminathan, & Rogers, 1991) defined as. 



2 ,. = 





'be 

I 



( 2 ) 



The diagonal elements of the matrix represent the information assoeiated with eaeh 
parameter. The problem, of eourse, is that the true parameters are unknown. Our best 

approximation is then to evaluate information at the values of the parameter estimates <a,b,c> 
and hope that these are reasonably elose to the true values. The estimates were averaged aeross 
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nine calibrations to obtain the best estimate for each item. It is, however, true that the 
item parameter estimates are often constrained to avoid taking on inappropriate values (e.g., 
negative a-parameters or c-parameters outside the range [0, 1]). Such constraints are liable to 
upset asymptotic theory and render the sampling variance approximations less valid. 

In the current situation, a second means is available for estimating sampling variation. 
The items under study were administered on nine separate occasions, and parameter estimates 
were separately obtained from each administration sample. The observed variation across these 
estimates is therefore an empirical estimate of the sampling fluctuation of the parameter 
estimates defined as. 



S,. = 



O-a 


O-ab 


^ac 




2 




^ab 


^b 


O-bc 


^ ac 


^bc 





( 3 ) 



In theory, and under all of the assumptions of that theory, the empirical and asymptotic 
estimates of sampling variation should be very similar. However, the empirical variances are 
only based on nine observations and may not be very stable. Both asymptotic and empirical 
sampling variance estimates are therefore problematic to some extent. It was therefore decided to 
repeat the analyses with both. 

The last and the most affirming set of analyses was performed to look at the effect of 
variation in the item parameter estimates on the actual reported scores. The responses of 
examinees on the anchor items were selected for the nine sets of calibrations on both measures. 

A typical ability distribution for the examinees during an administration is given in Figure A5 for 
both Quantitative and Verbal. Each response set was then scored using the set of item parameter 
estimates obtained on it during calibration process and then using the parameter estimates 
obtained using each of the other eight sets of responses. Hence 9 sets of scores were obtained 
from each response set. A grand total of 8 1 sets of scores were produced from the total of 9 
response sets. The scoring was carried out using maximum likelihood estimation methodology 
(Lord, 1980; Hambleton et ah, 1991). RMSE statistics between each set of the baseline theta 
estimates (or scores obtained using the set of item parameter estimates obtained on the same 
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response set during ealibration proeess) and the estimates from eaeh of the other eight sets were 
then eomputed. The statistie was defined as, 



RMSEck = 






( 4 ) 



where 6,^. is the ability estimate obtained for an examinee j on examinee set (or response set) k 
using item parameter estimates obtained by ealibrating response set k. On the other hand, 0^. is 

the ability estimate for an examinee j on examinee set k, using item parameter estimates obtained 
by ealibrating response set c. 

The ability estimates were then mapped on to the reported seore seale and the 
distributions of differenees (Scoreg. Scoretj) for eaeh of the 81 seenarios were plotted. The 
differenees were expressed on the operational or reported seore seale where the reported seores 
for this partieular program are expressed in 10-point intervals. 

Case Study 2: Adaptive Administration of Items 
Estimation Methodology 

The seeond part of this investigation was earried out on a set of adaptively administered 
operational items from another high stakes admissions test. This partieular program uses the item 
speeifie prior methodology with a proprietary version of eomputer software PARSCALE 
(Muraki & Boek, 1999). This methodology allows unique multivariate normal distributions to be 
used as prior distributions for the parameters of eaeh item (Swaminathan & Gifford, 1986; Folk 
& Golub-Smith, 1996). These item speeifie priors are aetually the mean estimates of the {b, a, c) 
parameters as well as the asymptotie varianee-eovarianee matrix speeified as {Intercept, a, c). 
These priors are used for the CAT operational items and are different for eaeh item, as they are 
item speeifie. On the other hand, global priors are used for the pretest items and are the same for 
all pretest as well as anehor items. The global prior distributions for the a-parameter are 
approximated by lognormal distribution, h-parameter distributions are approximated by normal, 
and the c-parameter prior distributions are approximated by beta distribution. All pretest, anehor, 
and CAT items are ealibrated together for an administration. In this ease, pretest or anehor items 
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are actually embedded in the operational test. This is unlike the previous case, where a pretest or 
anchor set is offered as a separate section. Since the priors on the CAT items are strong, their 
values hardly move away from their original values. The CAT items, therefore, set the scale; 
thus, putting all items on the same scale. Once calibrated, the pretest item parameter estimates 
are stored in the item bank to be used in subsequent pools, while the operational item parameter 
estimates are not used further. This methodology has been shown to be effective in utilizing data 
from operational items that do not have a uniform distribution of ability, since they are 
administered adaptively. 

Data 

The data for this investigation came from the Quantitative measure of an adaptively 
administered high stakes admissions test. Items that had already appeared in operational pools 
and had been included in several pretest calibrations to hold the scale (with item specific priors 
on them) were identified. In order to obtain relatively uniform ability distributions, 30 items that 
were slightly easy, mid-difficulty, or adequately difficult and had sample sizes larger than 500 
associated with them were chosen. The item parameter estimates for these items were originally 
obtained when they were pretested in P & P administrations before the introduction of CAT. The 
mean and standard deviations for the original a, b, and c parameters are give in Table 3. 

Table 3 

Mean and Standard Deviations for the a, b, and c Parameters (Original P & P) 





a 


b 


c 


Mean 


1.07 


0.23 


0.16 


Stdev 


0.19 


0.72 


0.05 



All chosen items had appeared in several pools and had been included in at least 8 
calibrations. The number of calibrations available on these items is given in Table 4. 

The ability distributions of examinees who received these items in each calibration were 
inspected to make sure that the range of examinee abilities for each of these items was not 
restrictive. For the purpose of this investigation, all calibrations were rerun with the following 
modification; the item specific priors were removed and global priors were imposed on these 
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CAT items, thus they were treated like other pretest items. The modified requests for the 
ealibration were resubmitted using ETS-speeifie software ealled GENASYS, whieh uses 
PARSCAEE for ealibration. Items were then ealibrated in this modified way and new parameter 
estimates were obtained. 



Table 4 

Number of Calibrations 



No. of items 


No. of ealibrations 


5 


8 


12 


9 


7 


10 


6 


11 



Analyses 

Similar to the previous ease study, the item eharaeteristie eurves were examined for eaeh 
item. The weighted RMSEs were then eomputed between the ICCs for the first ealibration, 
eompared with the other ealibrations as diseussed in the previous study. The first ealibration was 
arbitrarily ehosen as the point of eomparison. 

The next part of the analyses involved looking at the effeet of variation in parameter 
estimates on ability estimation. Unlike the linear ease, where a fixed number of ealibrations were 
available on eaeh item, the number of ealibrations varied in this ease (as shown in Table 5, the 
number of ealibrations on various items varied from 8 to 1 1). Thus, 20 sets of item parameter 
estimates were generated for eaeh item by drawing parameters at random from the various 
ealibrations available for that item (exeept the first ealibration). A response set was obtained by 
generating responses for 1 ,000 examinees at 1 1 ability levels eorresponding to the ability levels 
listed in Table 5. These ability levels are obtained on the number-right seale from the referenee P 
& P base form. The number-right seore for this partieular test ranged from 0 to 60, resulting in 
1 1 ability levels with a 6-point interval. The ability levels when eonverted on to the theta metrie 
ranged from -3.138 to 2.592. 

The item parameter estimates used to generate the response set eame from the first 
ealibration and were eonsidered as the baseline estimates. The response set was seored using 
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baseline parameter estimates from the first ealibration, as well as using the 20 other randomly 
ehosen sets of estimates. 



Table 5 

Ability levels and corresponding weights for Quantitative CAT 



Eevel 


Ability 


Weight 


1 


-3.138 


0.007 


2 


-1.970 


0.057 


3 


-1.289 


0.107 


4 


-0.772 


0.145 


5 


-0.337 


0.163 


6 


0.054 


0.154 


7 


0.426 


0.135 


8 


0.800 


0.110 


9 


1.210 


0.077 


10 


1.725 


0.039 


11 


2.592 


0.005 



The first set of seores was then eompared to the other 20 sets of seores. RMSEs were 
eomputed between the various sets of ability estimates at eaeh ability level. Sinee reetangular 
distribution was simulated, the mean sum-of-squares at various ability levels were weighted in 
order to eompute the overall RMSE. The ability estimates were then eonverted to sealed or 
reported seores and the distributions of differenees between those seores obtained using various 
sets of estimates were eompared. The differenees were expressed on the operational or reported 
seore seale where the reported seores for this partieular program are expressed in 1 -point 
intervals. 

Next, the seoring analyses were repeated by generating response data using the item bank 
parameters for these items. As mentioned earlier, these parameters were originally obtained from 
P & P pretest ealibrations. These analyses were expeeted to reveal more variation in seores due 
to P & P eontext effeets in addition to positional effeets obtained from adaptive administrations. 
In real ealibrations, these estimates are used as priors for the eorresponding items; henee, it is 
important to know whether sueh eontext effeets influenee the parameter estimation. The response 
set was then seored using the same set of item parameter estimates as well as the remaining 20 
sets of estimates. 
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Results 



The results of the analyses on linearly administered items are presented first, followed by 
the adaptively administered items. 

Results for Case Study 1 

It should be noted that, for the sake of brevity and clarity, results for Quantitative and 
Verbal are presented and discussed side-by-side; however, the authors do not intend to compare 
the two measures. Figure 1 presents the test characteristic curves (TCC) for the set of anchor 
items for the two measures. 




Figure 1. TCC for Quantitative and Verbal anchor items over nine calibrations. 



The TCCs for both measures were extremely close under both scenarios. Some variations 
at the tails of the curve are characteristic of the interaction between the abilities of examinees and 
difficulty level of the items. Those variations are also shown in the plots of ICCs presented in 
Figures A1 and A2 in Appendix A. The plots of ICCs for selected Quantitative and Verbal items 
show that the probabilities of getting an item right did not vary substantially across calibrations, 
except at the very extreme ends of the scale. The investigation of the general trends did not 
exhibit any directional change in the estimates. In other words, none of the items exhibited a 
systematic decrease or increase in the parameter estimates over repeated calibrations. 
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The RMSEs among ICCs for the two measures are shown in Figure 2. The aetual values of 
the weighted RMSEs are presented in Tables A2 and A3. The RMSEs indicated a small variation 
between calibrations for both Quantitative and Verbal measures. The differences were slightly 
higher for Quantitative, especially for some of the items. An item with a very high difficulty level 
is what appeared to be the most variant in the Quantitative measure. Inspecting the sample sizes 
and ability distribution for that particular calibration of that item did not suggest any explanation 
beyond chance-level differences in responding for examinees at extreme ability levels. 





Figure 2. Weighted RMSEs between ICCs between first calibration and the others. 



In comparing the model-based vs. empirical variation (see Figures A3 and A4), it was 
found that the model-based variation was larger than the empirical variation for both 
Quantitative and Verbal measures for the h-parameter. The model-based variance was highly 
affected by the magnitude of the h-parameter: very low h-parameters resulted in large values of 
model-based variance. In the case of the a-parameter, model-based variance was larger than the 
empirical variation for the Quantitative measure while smaller for the Verbal measure. The a- 
parameters for Verbal were, in general, higher in magnitude. 
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In general, the results did indieate very small model-based and empirieal variation in both 
a- and b-parameters, exeept for the model-based varianee in h-parameter for Quantitative. As 
these items provide very little information, the extremely low h-parameters for some of the 
Quantitative items eaused this varianee. In general, these results should be interpreted with 
eaution, as the samples for the analyses were not suitably sized. 

The results of seoring using the different sets of parameter estimates are presented in 
Figures 3 through 6. The results indieated that the RMSEs in ability estimates ranged from 0.13 to 
0.33 for Quantitative and 0.13 to 0.34 for Verbal where a response set used in a ealibration was 
seored by item parameters obtained from different response sets (81 eases). Results of two sueh 
seenarios are shown in Figure 3, where Quantitative response sets 1 and 9, respeetively, are seored 
using item parameter estimates obtained from eaeh of the other response sets. The figures show 
that the error in estimates, when scored using different sets of parameter estimates, remained fairly 
consistent across calibrations. Similar scenarios for Verbal measure are presented in Figure 4. The 
differences in examinee reported scores, when scored using different sets of item parameter 
estimates, remained limited to a 0-20 point difference on the reported score scale for majority of 
the examinees (as mentioned before, the reported scale for this particular program is expressed in 
10-point intervals). Of the examinees, 83% to 98% (91% on average) exhibited a 0-20 point score 
difference for the Quantitative measure. Figure 5 illustrates this result for two typical cases. 



QNT RMSE (Cal. 1 vs. others) 

0.5 - 
0.4 



uj 0.3 




Calibrations 



QNT RMSE (Cal. 9 vs. others) 
0.5 

0.4 




Calibrations 



Figure 3. RMSEs between ability estimates on a response set scored by its own and other 
sets of item parameter estimates — Quantitative. 
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Figure 4. RMSEs between ability estimates on a response set scored by its own and other 
sets of item parameter estimates — ^Verbal. 



The first part of the figure shows response set 2 scored using item parameter estimates 
obtained on response set 1 . The second part shows response set 9 scored using item parameter 
estimates obtained on response set 5. In the first scenario, 93% of the examinees exhibited a 0- 
20 point difference in their reported scores, while 90% showed this difference for the second 
scenario. Similar results for Verbal are shown in Figure 6. The percentages of examinees 
exhibiting 0-20 point score differences ranged from 87% to 98% (94% on average) for Verbal. 




Figure 5. Frequency distribution of reported score differences for Quantitative, 
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Figure 6. Frequency distribution of reported score differences for Verbal. 



Since the average Standard Error of Measurement (SEM) ranges from 35 to 45 points for 
the Quantitative measure and 30 to 40 points for the Verbal measure of this partieular program, 
the results were promising for both measures. 

Results for Case Study 2 

The ICC plots for seleeted adaptively administered items are presented in Eigure B1 in 
Appendix B. The weighted RMSEs between ICCs for the adaptive ealibrations in comparison with 
first adaptive ealibration are presented in Eigure 7. These values are also presented in Tables B1 
and B2 for readers’ interest. The RMSEs among ICCs for those items revealed remarkably small 
variation. The values remained in the range of 0.01 and 0.20 for all items for all ealibrations. 

When eompared with scores based on first ealibration, the differenees in reported scores 
for the adaptive ease, ranged from 0 to 2 points for 90% to 98% (96% on average) of examinees 
for 20 item parameter sets that were drawn from ealibrations. At this point it is worth mentioning 
again that the reported seore seale for this partieular program is expressed in 1 -point intervals. 

The eonsisteney of the RMSEs in the ability estimates aeross 20 item parameter sets 
drawn from available ealibrations is depieted in Eigure 8. When investigated per ability level 
(Eigure 9), a large portion of the error seemed to eoneentrate in the low ability levels. 
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Figure 7. Weighted RMSEs between CAT ICCs between first calibration and the others. 
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Figure 8. Weighted RMSEs between ability estimates on a response set scored by its own 
and another set of item parameter estimates — own set = 1st calibration. 
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Figure 9. Weighted RMSEs between ability estimates on a response set scored by its own 
and another set of item parameters by ability level — own set = 1st calibration. 




Figure 10. Frequency distribution of reported score differences — comparison with 1st CBT 
calibration. 
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When P & P calibrated estimates of the items were used in place of the first calibration 
for comparison between calibrations, the results were quite different. Figure 1 1 shows the 
RMSEs between theta estimates obtained on the P & P calibrated sets of parameter estimates and 
20 sets of estimates obtained on CBT calibrations. The results indicate an increase of overall 
RMSEs, when abilities obtained using P & P estimates were used for comparison. While the 
scenario where comparisons were based on 1st calibration and the overall RMSEs between 
scores ranged from 0.12 to 0.20, the errors ranged from 0.19 to 0.30 here. The errors in the 
scores remained significantly small at the middle ability levels, higher for the high ability levels, 
and highest for the low ability levels, when compared across ability levels. The errors were as 
high as 0.63 at the lower ability levels. 



Weighted RMSEs between scores 
(bank parameter set vs. 20 sets) 



C/2 
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Figure 11. RMSEs between ability estimates on a response set scored by its own and other 
sets of item parameter estimates — own set = P & P bank parameters. 



The percentage of examinees that exhibited reported score differences between 0-2 
points on the reported score scale ranged from 87% to 94% (91% on average). This percentage 
was considerably smaller than the previous scenario where most of the cases resulted in more 
than 93% of the examinees exhibiting a 0-2 point difference. In other words, the percentage of 
examinees whose scores changed by more than 2 points was significantly large in this case. 
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Figure 12. Weighted RMSEs between ability estimates on a response set scored by its own 
and another set of item parameters by ability level — own set = P & P bank parameters. 




Figure 13. Frequency distribution of reported score differences — comparison with P & P 
calibrated parameters. 
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The SEM for the Quantitative measure of this program usually ranges from 2.5 to 3.5 
points. In the first seenario, 96% of people exhibiting less than or equal to a 2-point differenee 
represented an eneouraging result. In other words, 4% of the examinees exhibited a differenee of 
three-or-more points in their reported seores. For the seeond seenario, where bank parameters were 
used to seore the responses, 9% of the examinees showed a differenee of three-or-more points. 

Conclusions 

The studies diseussed in this paper investigated the effeet of stability of item parameter 
estimation in the eurrent CBT ealibrations. The results of the study will serve as a baseline for 
the design work involved in ereating models for automated item generation. The eoneept of 
having a single model to generate a family of items should be informed by knowing the relative 
stability of the parameter estimates when ealibrated online. 

Several eonelusions ean be drawn from the results of this study. The linearly 
administered items in a high stakes testing program exhibited remarkably small variation in 
parameter estimates over repeated ealibrations. Although the sample sizes upon whieh the 
ealibrations were performed varied eonsiderably, the results were not affeeted. As long as the 
sample sizes are large enough to ealibrate, stable results are produeed. Similar findings with 
adaptively administered items in another high stakes testing program were also found when 
initial adaptively based item parameter estimates were eompared with estimates from repeated 
use. These findings have implieations for researeh on item modeling beeause they suggest that 
the use of item modeling with operational CAT programs will introduee more variation in ability 
estimation due to item eontext effeets, positional effeets, and the small sample sizes obtained for 
some items. It will be important to quantify and aceount for these sourees of variation as this 
researeh progresses. 

The results of this study also indieate that eontext effeets played a more signifieant role in 
adaptive item parameters when the eomparisons were made to the parameters that were obtained 
from P & P testing. Even though PARSCAEE was used to ealibrate both sets of items, however, 

P & P items went through eoneurrent ealibrations as opposed to item-speeifie prior methodology 
used for adaptive items; this faet may also have eaused some variation. This suggests that the 
parameter estimates obtained on P & P administrations should be replaeed. 
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whenever feasible, with the CBT ealibrated parameters. The approaeh employed for this paper 
(i.e., freeing the item speeifie priors that eonstrain item parameter estimates for seleeted 
operational items during the proeess of pretest item ealibration) is one possible alternative for 
this kind of updating. However, further researeh would be neeessary to determine if this 
approaeh would be feasible in the context of an ongoing, operational CAT program. 
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Appendix A 



Table A1 

Sample Sizes for Each Calibration 



Calibration 


Total 

sample 


Act. sample size 
per anchor item 


# of perfect 
scores 


Final sample 


QNT 


1 


6,656 


1,299 


8 


1,291 


2 


10,178 


1,420 


15 


1,405 


3 


16,311 


1,182 


7 


1,175 


4 


20,018 


1,115 


11 


1,104 


5 


6,038 


833 


8 


825 


6 


17,949 


1,432 


6 


1,426 


7 


19,863 


2,323 


18 


2,305 


8 


16,493 


858 


14 


844 


9 


20,422 


636 


9 


627 


VBL 


1 


13,632 


2,287 


3 


2,284 


2 


8,774 


1,066 


2 


1,064 


3 


13,329 


992 


0 


992 


4 


14,697 


1,118 


4 


1,114 


5 


15,151 


1,047 


2 


1,045 


6 


11,026 


876 


3 


873 


7 


2,130 


1,569 


2 


1,567 


8 


5,869 


834 


4 


830 


9 


24,945 


939 


2 


937 
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Figure Al. ICCs for four Quantitative items over nine calibrations. 
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Figure A2. ICCs for four Verbal items over nine calibrations. 
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Table A2 

Weighted RMSEs in ICCs for Quantitative Measure 





2 


3 


4 


5 


6 


7 


8 


9 


1 


0.04 


0.01 


0.02 


0.02 


0.02 


0.02 


0.03 


0.03 


2 


0.03 


0.01 


0.03 


0.02 


0.02 


0.02 


0.03 


0.04 


3 


0.03 


0.02 


0.04 


0.02 


0.03 


0.04 


0.03 


0.03 


4 


0.03 


0.03 


0.04 


0.04 


0.01 


0.02 


0.02 


0.04 


5 


0.03 


0.03 


0.01 


0.03 


0.02 


0.01 


0.03 


0.06 


6 


0.02 


0.02 


0.02 


0.02 


0.02 


0.01 


0.02 


0.04 


7 


0.03 


0.02 


0.03 


0.02 


0.02 


0.01 


0.05 


0.02 


8 


0.01 


0.01 


0.00 


0.02 


0.02 


0.02 


0.04 


0.02 


9 


0.02 


0.02 


0.04 


0.01 


0.03 


0.02 


0.02 


0.03 


10 


0.03 


0.03 


0.04 


0.02 


0.03 


0.02 


0.03 


0.04 


11 


0.01 


0.02 


0.04 


0.04 


0.03 


0.03 


0.04 


0.02 


12 


0.02 


0.02 


0.01 


0.02 


0.02 


0.03 


0.05 


0.03 


13 


0.01 


0.02 


0.04 


0.03 


0.02 


0.03 


0.03 


0.03 


14 


0.02 


0.02 


0.01 


0.03 


0.02 


0.03 


0.03 


0.04 


15 


0.04 


0.03 


0.03 


0.06 


0.04 


0.03 


0.05 


0.03 


16 


0.04 


0.05 


0.04 


0.02 


0.04 


0.02 


0.06 


0.05 


17 


0.03 


0.02 


0.03 


0.02 


0.02 


0.01 


0.04 


0.01 


18 


0.05 


0.04 


0.05 


0.04 


0.02 


0.03 


0.03 


0.04 


19 


0.03 


0.04 


0.08 


0.02 


0.03 


0.02 


0.02 


0.02 


20 


0.01 


0.03 


0.03 


0.01 


0.03 


0.01 


0.03 


0.04 


21 


0.02 


0.02 


0.02 


0.04 


0.03 


0.03 


0.05 


0.02 


22 


0.02 


0.03 


0.00 


0.03 


0.03 


0.01 


0.05 


0.03 


23 


0.02 


0.03 


0.01 


0.04 


0.02 


0.02 


0.03 


0.02 


24 


0.02 


0.01 


0.01 


0.01 


0.03 


0.01 


0.03 


0.07 


25 


0.01 


0.04 


0.02 


0.04 


0.03 


0.03 


0.03 


0.02 


26 


0.02 


0.02 


0.01 


0.02 


0.03 


0.02 


0.01 


0.02 


27 


0.02 


0.04 


0.04 


0.05 


0.04 


0.01 


0.03 


0.02 


28 


0.02 


0.03 


0.04 


0.04 


0.03 


0.03 


0.02 


0.02 


Average 


0.02 


0.02 


0.03 


0.03 


0.03 


0.02 


0.03 


0.03 
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Table A3 

Weighted RMSEs in ICCs for Verbal Measure 





2 


3 


4 


5 


6 


7 


8 


9 


1 


0.01 


0.03 


0.04 


0.01 


0.02 


0.02 


0.02 


0.02 


2 


0.03 


0.02 


0.03 


0.01 


0.03 


0.03 


0.03 


0.03 


3 


0.03 


0.03 


0.05 


0.06 


0.05 


0.06 


0.07 


0.08 


4 


0.04 


0.05 


0.06 


0.06 


0.07 


0.05 


0.04 


0.08 


5 


0.01 


0.05 


0.02 


0.00 


0.04 


0.03 


0.02 


0.04 


6 


0.03 


0.05 


0.06 


0.05 


0.04 


0.07 


0.06 


0.06 


7 


0.03 


0.01 


0.01 


0.01 


0.00 


0.01 


0.03 


0.02 


8 


0.07 


0.07 


0.01 


0.05 


0.02 


0.02 


0.07 


0.01 


9 


0.03 


0.02 


0.04 


0.05 


0.06 


0.04 


0.06 


0.09 


10 


0.02 


0.02 


0.04 


0.04 


0.02 


0.03 


0.05 


0.05 


11 


0.00 


0.01 


0.01 


0.02 


0.02 


0.01 


0.01 


0.03 


12 


0.03 


0.03 


0.02 


0.04 


0.03 


0.03 


0.05 


0.03 


13 


0.01 


0.02 


0.01 


0.03 


0.01 


0.01 


0.04 


0.03 


14 


0.02 


0.03 


0.02 


0.05 


0.02 


0.04 


0.04 


0.04 


15 


0.01 


0.02 


0.04 


0.01 


0.03 


0.02 


0.04 


0.01 


16 


0.03 


0.06 


0.03 


0.04 


0.03 


0.07 


0.04 


0.03 


17 


0.04 


0.04 


0.03 


0.02 


0.06 


0.02 


0.02 


0.03 


18 


0.02 


0.01 


0.02 


0.02 


0.02 


0.03 


0.03 


0.02 


19 


0.05 


0.04 


0.02 


0.02 


0.04 


0.04 


0.02 


0.04 


20 


0.00 


0.02 


0.04 


0.05 


0.03 


0.06 


0.02 


0.06 


21 


0.03 


0.03 


0.03 


0.03 


0.05 


0.03 


0.02 


0.04 


22 


0.03 


0.01 


0.02 


0.01 


0.04 


0.02 


0.02 


0.01 


23 


0.02 


0.01 


0.03 


0.03 


0.03 


0.02 


0.04 


0.02 


24 


0.03 


0.04 


0.05 


0.05 


0.06 


0.03 


0.03 


0.05 


25 


0.02 


0.02 


0.03 


0.02 


0.02 


0.02 


0.02 


0.01 


26 


0.03 


0.02 


0.03 


0.03 


0.02 


0.03 


0.05 


0.04 


27 


0.03 


0.03 


0.04 


0.03 


0.02 


0.02 


0.04 


0.04 


28 


0.02 


0.03 


0.04 


0.03 


0.05 


0.03 


0.08 


0.06 


29 


0.04 


0.06 


0.07 


0.05 


0.07 


0.05 


0.06 


0.04 


30 


0.10 


0.03 


0.04 


0.07 


0.06 


0.05 


0.05 


0.07 


Average 


0.03 


0.03 


0.03 


0.03 


0.04 


0.03 


0.04 


0.04 
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Figure A3. Model-based vs, empirical average variance for a- and b-parameters. 





Figure A4. Model-based vs, empirical average variance for a- and 6-parameters after 
deleting two very easy Quantitative items. 
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Typical Verbal Distribution 
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Figure A5. Typical ability distributions for Quantitative and Verbal measures. 
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Appendix B 

Results for Adaptively Administered Items 
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Figure Bl. ICCs for four Quantitative CAT items. 
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Figure B2. Weighted RMSEs in ICCs for CAT items on Quantitative measure. 
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Table B1 

Weighted RMSEs in ICCs for CAT Items on Quantitative Measure 





2 


3 


4 


5 


6 


7 


8 


r 


10 ^ 


ir 


1 


0.03 


0.03 


0.03 


0.06 


0.03 


0.04 


0.05 


0.06 


0.01 


0.05 


2 


0.01 


0.01 


0.02 


0.05 


0.03 


0.04 


0.04 








3 


0.00 


0.02 


0.07 


0.03 


0.01 


0.00 


0.01 


0.03 


0.03 




4 


0.01 


0.05 


0.03 


0.02 


0.07 


0.04 


0.03 


0.05 


0.11 




5 


0.07 


0.04 


0.01 


0.04 


0.01 


0.06 


0.04 


0.07 


0.03 




6 


0.03 


0.03 


0.02 


0.03 


0.01 


0.02 


0.03 


0.04 


0.05 


0.07 


7 


0.04 


0.06 


0.04 


0.03 


0.05 


0.08 


0.02 


0.04 


0.06 


0.06 


8 


0.03 


0.03 


0.03 


0.04 


0.03 


0.02 


0.02 


0.02 


0.03 




9 


0.07 


0.05 


0.03 


0.04 


0.05 


0.03 


0.02 


0.03 


0.05 


0.06 


10 


0.02 


0.04 


0.02 


0.02 


0.02 


0.00 


0.02 


0.03 






11 


0.06 


0.06 


0.06 


0.06 


0.05 


0.07 


0.07 


0.05 


0.05 




12 


0.02 


0.04 


0.05 


0.04 


0.04 


0.07 


0.04 


0.06 


0.10 




13 


0.07 


0.04 


0.06 


0.10 


0.09 


0.06 


0.10 








14 


0.02 


0.03 


0.04 


0.06 


0.08 


0.04 


0.03 


0.08 






15 


0.03 


0.07 


0.04 


0.06 


0.08 


0.06 


0.06 


0.05 


0.05 


0.01 


16 


0.08 


0.03 


0.06 


0.01 


0.04 


0.01 


0.08 


0.02 






17 


0.02 


0.02 


0.01 


0.04 


0.02 


0.03 


0.06 


0.06 






18 


0.02 


0.02 


0.04 


0.02 


0.04 


0.00 


0.05 


0.04 


0.02 




19 


0.03 


0.03 


0.02 


0.03 


0.01 


0.02 


0.03 


0.06 


0.03 




20 


0.01 


0.04 


0.06 


0.02 


0.05 


0.04 


0.03 








21 


0.05 


0.03 


0.05 


0.13 


0.03 


0.01 


0.02 








22 


0.04 


0.02 


0.06 


0.03 


0.04 


0.01 


0.04 


0.02 


0.04 




23 


0.01 


0.03 


0.04 


0.03 


0.07 


0.01 


0.03 


0.05 


0.03 




24 


0.05 


0.17 


0.05 


0.06 


0.02 


0.02 


0.04 


0.03 






25 


0.03 


0.08 


0.03 


0.03 


0.03 


0.04 


0.02 


0.01 


0.03 




26 


0.04 


0.03 


0.01 


0.04 


0.02 


0.02 


0.07 


0.08 






27 


0.08 


0.05 


0.04 


0.03 


0.05 


0.02 


0.04 








28 


0.01 


0.03 


0.01 


0.03 


0.02 


0.00 


0.01 


0.05 






29 


0.04 


0.02 


0.04 


0.03 


0.03 


0.05 


0.03 


0.04 


0.04 


0.05 


30 


0.01 


0.10 


0.06 


0.20 


0.02 


0.09 


0.04 


0.07 


0.08 





^ Some cells are empty as the number of calibrations varied from 8 to 1 1 for different items. 
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Table B2 

Weighted RMSEs in ICCs for CAT Items on Quantitative Measure (P & P or Bank Parameter 
Estimates) 





1 


2 


3 


4 


5 


6 


7 


8 


9 " 


10 " 


11 " 


1 


0.09 


0.06 


0.08 


0.07 


0.05 


0.09 


0.07 


0.06 


0.06 


0.08 


0.05 


2 


0.02 


0.01 


0.01 


0.03 


0.06 


0.02 


0.05 


0.04 








3 


0.03 


0.03 


0.05 


0.08 


0.05 


0.04 


0.03 


0.03 


0.06 






4 


0.11 


0.11 


0.15 


0.13 


0.10 


0.17 


0.15 


0.13 


0.15 






5 


0.03 


0.05 


0.04 


0.02 


0.02 


0.03 


0.04 


0.02 


0.06 






6 


0.04 


0.06 


0.01 


0.03 


0.06 


0.05 


0.05 


0.07 


0.08 


0.09 


0.10 


7 


0.07 


0.02 


0.01 


0.04 


0.04 


0.05 


0.03 


0.06 


0.03 


0.03 


0.05 


8 


0.06 


0.06 


0.06 


0.06 


0.06 


0.04 


0.06 


0.07 


0.05 


0.05 




9 


0.06 


0.03 


0.05 


0.06 


0.06 


0.03 


0.07 


0.06 


0.08 


0.05 


0.04 


10 


0.04 


0.04 


0.08 


0.06 


0.03 


0.03 


0.04 


0.07 


0.03 






11 


0.05 


0.01 


0.02 


0.01 


0.04 


0.04 


0.04 


0.05 


0.04 






12 


0.10 


0.10 


0.07 


0.08 


0.06 


0.10 


0.06 


0.07 


0.04 






13 


0.06 


0.02 


0.03 


0.03 


0.05 


0.06 


0.02 


0.05 








14 


0.12 


0.13 


0.15 


0.16 


0.17 


0.19 


0.14 


0.15 


0.20 






15 


0.06 


0.09 


0.13 


0.08 


0.11 


0.14 


0.10 


0.11 


0.11 


0.02 


0.06 


16 


0.04 


0.06 


0.03 


0.04 


0.05 


0.02 


0.03 


0.06 


0.03 






17 


0.08 


0.10 


0.10 


0.09 


0.05 


0.10 


0.10 


0.11 


0.13 






18 


0.07 


0.06 


0.06 


0.05 


0.08 


0.06 


0.08 


0.04 


0.05 


0.06 




19 


0.03 


0.02 


0.04 


0.03 


0.05 


0.02 


0.02 


0.01 


0.05 


0.03 




20 


0.05 


0.05 


0.03 


0.06 


0.05 


0.04 


0.01 


0.03 








21 


0.05 


0.05 


0.02 


0.02 


0.09 


0.05 


0.04 


0.06 








22 


0.07 


0.05 


0.05 


0.03 


0.04 


0.04 


0.06 


0.03 


0.05 


0.03 




23 


0.06 


0.08 


0.10 


0.10 


0.09 


0.13 


0.06 


0.10 


0.11 


0.10 




24 


0.14 


0.18 


0.12 


0.19 


0.13 


0.12 


0.15 


0.18 


0.16 






25 


0.10 


0.07 


0.12 


0.12 


0.12 


0.09 


0.06 


0.08 


0.11 


0.06 




26 


0.11 


0.13 


0.10 


0.12 


0.08 


0.10 


0.12 


0.16 


0.04 






27 


0.06 


0.11 


0.10 


0.08 


0.08 


0.02 


0.07 


0.08 








28 


0.05 


0.04 


0.07 


0.06 


0.06 


0.06 


0.05 


0.06 


0.08 






29 


0.07 


0.06 


0.07 


0.05 


0.04 


0.06 


0.07 


0.03 


0.04 


0.03 


0.03 


30 


0.11 


0.11 


0.18 


0.15 


0.24 


0.13 


0.14 


0.13 


0.13 


0.10 





^ Some cells are empty as the number of calibrations varied from 8 to 1 1 for different items. 
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